Grammar-based Corpus Annotation

نویسنده

  • Stefanie Dipper
چکیده

There is an increasing number of linguists interested in large syntactically annotated corpora (treebanks). Such corpora can serve as a base for statistical applications and, at the same time, may be used in theoretical linguistics as a source for investigations about language use. The most important treebank nowadays is the Penn Treebank (Marcus et al., 1993; Marcus et al., 1994). Many statistical taggers and parsers have been trained on this treebank, e.g. (Ramshaw and Marcus, 1995; Srinivas, 1997; Alshawi and Carter, 1994). Furthermore, context-free and uni cationbased grammars have been derived from the Penn Treebank (Charniak, 1996; van Genabith et al., 1999a; van Genabith et al., 1999c; van Genabith et al., 1999b). These parsers, trained or created by means of the treebank, very successfully parse unseen text with respect to correct POS tagging and chunking, and hence can be applied for enlarging the treebank. However, the situation is di erent for languages other than English. Ongoing projects are still in the process of building treebanks, e.g. for German (NEGRA corpus (Skut et al., 1997), now continued in the TIGER project; the German treebank in Verbmobil (Stegmann et al., 1998)), for Czech (The Prague Dependency Treebank (Haji£,

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies

A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...

متن کامل

BANK OF ENGLISH AND BEYOND Hand-crafted parsers for functional annotation

The 200 million word corpus of the Bank of English was annotated morphologically and syntactically using the English Constraint Grammar analyser, a rulebased shallow parser developed at the Research Unit for Computational Linguistics, University of Helsinki. We discuss the annotation system and methods used in the corpus work, as well as the theoretical assumptions of the Constraint Grammar syn...

متن کامل

Syntactic annotation of medieval texts: the Syntactic Reference Corpus of Medieval French (SRCMF)

This article presents the Syntactic Reference Corpus of Medieval French (SRCMF). The corpus is composed of texts taken from the two major Old French corpora, the Base de Français Médiéval and the Nouveau Corpus d'Amsterdam. This contribution describes some of the core principles of the annotation model, which is based on dependency grammar, as well as the annotation procedure and representation...

متن کامل

Development of Tree-bank Based Probabilistic Grammar for Urdu Language

The process includes in hand tagged corpus, tree annotation on paper for large corpus, NU-FAST Treebank in form of brackets, extraction of CFG through NU-FAST Treebank, evaluation of PCFG from CFG and then PDCG from PCFG for inspection/testing through PROLOG parser.

متن کامل

Grammar Extraction and Refinement from an HPSG Corpus

Grammar learning and refinement on the basis of language resources is very appealing in comparison with manual development of formal grammar. But in order to learn a complex grammar a complex resource is needed. Thus the creation of language resources and learning of grammars from them have to be aware of each other. In this paper we define a formal basis for annotation of corpora with respect ...

متن کامل

Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and its Application

In Japanese, the syntactic structure of a sentence is generally represented by the relationship between phrasal units, bunsetsus in Japanese, based on a dependency grammar. In many cases, the syntactic structure of a bunsetsu is not considered in syntactic structure annotation. This paper gives the criteria and definitions of dependency relationships between words in a bunsetsu and their applic...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000